This tutorial will demonstrate how to analyse audio data for acoustic phonetic studies in R. It is mainly intended to demonstrated possible workflows. The topics covered are:

Preamble

The generated HTML page of this tutorial is available here.

This tutorial is organised as an R Markdown notebook. To execute the code within this notebook (filename Lecture.Rmd) it has to be opend in RStudio. When executing code, the results appear beneath the code. In order to do so, you need to have R and RStudio installed. When this R notebook is loaded into RStudio, you can excecute chunks by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter, allowing you to experiment with the code. This handout contains all output of the code (tables, visualisations etc.). The easiest way to work with this R code is to clone the entire project from this GitHub repository.

The following libraries will be needed and have to be installed.

library(tidyverse)
library(ggiraph)
library(cowplot)
library(emuR)
library(tools)
library(rio)
library(ggplot2)
library(magrittr)
library(ggiraph)
library(htmltools)
library(shiny)
library(joeyr)
library(knitr)

1 The LOD database

The database contains the audio recordings from the Lëtzebuerger Online Dictionnaire (available here, spoken by one female speaker. The audio files have been automatically segmented with the MAUS tools. We thus have a database conisting of textual data, basically words, and the corresponding audio data. The audio data is segmented into words and phonetic segments (sounds).

This database has been created beforehand. Infos how to create such a database is explained in the EMU-SDMS manual.

While we will be working with the full database (26.000 recordings), a small demo database lod_emuDB has been created with 500 recordings which can be used for individual testing.

We start with loading the database and give an overview of structure and content.

# load database
db = load_emuDB("/Users/peter.gilles/Documents/_Daten/LOD-emuDB/lod_emuDB")
INFO: Checking if cache needs update for 1 sessions and 26093 bundles ...
INFO: Performing precheck and calculating checksums (== MD5 sums) for _annot.json files ...
INFO: Nothing to update!
#db = load_emuDB("lod_emuDB")

# display the overview of the structure and content
summary(db)
Name:    lod 
UUID:    c87793c0-012a-11e9-874b-68b599b5deb4 
Directory:   /Users/peter.gilles/Documents/_Daten/LOD-emuDB/lod_emuDB 
Session count: 1 
Bundle count: 26093 
Annotation item count:  515638 
Label count:  639542 
Link count:  489545 

Database configuration:

SSFF track definitions:
      name columnName fileExtension
1      dft        dft           dft
2 praatFms         fm      praatFms
3     dft2        dft           dft

Level definitions:
    name    type nrOfAttrDefs              attrDefNames
1 bundle    ITEM            4 bundle; source; SAM; MAO;
2    ORT    ITEM            2                 ORT; KAN;
3    MAU SEGMENT            1                      MAU;

Link definitions:
         type superlevelName sublevelName
1 ONE_TO_MANY         bundle          ORT
2 ONE_TO_MANY            ORT          MAU

Tracks in emuR are acoustic representation of the speech signal, here dft for the waveform (time-amplitude representation) and praatFms for the formant measures of vowels (see below).

Levels in an eumR database stand for level of interlinked linguistic information. bundle is the entire audio file, ORT stands for the orthographical representation of the audio file segmented in its single words. MAU is the segmentation of all phonetic segment (=sounds) of all ORT segments in bundle.

The hierarchical structure of these levels is expressed in the link definitions as ONE-TO-MANY.

2 Database queries

An emuR database can be queried with a powerful query engine. The first example is a simple query for one word, Aarbecht.

sl = query(db, query = "[ ORT == 'Aarbecht']")
sl

The result is a segment list (sl), containing various information about the found item (time, level, name, database info). The result of the query can also be displayed in the EMU Speech Database Management System.

serve(db, seglist = sl)

The GUI will open in the Viewer pane of RStudio or you can open it in a browser (Chrome preferred).

Here we can also display the hiearchical structure for this database item, which is accessed during queries. bundle is the top-level, representing the entire audio file.

The ORT level contains the nodes for the individual words in the bundle, here the two words Aarbecht and Aarbechten. The dependend level then is MAU (=Munich Automatic Unit) representing the single sounds of the words in ORT.

Two aspects render emuR query system extremly powerful: the use of regular expressions (including negation and other extensions) and the combinated query on different levels of the database.

Let’s try more complex queries:

sl = query(db, query = "[ ORT =~ 'Aarbecht.*']")
sl

Select the vowel [aː] in all words beginning with Aarbecht… Note that in the segment list the label now has changed to the vowel and the respective start-end information is now only for this sound [aː].

sl = query(db, query = "[ ORT =~ 'Aarbecht.*' ^ #MAU=='aː']")
sl
sl = query(db, query = "[ MAU == e -> MAU == k ]")
# print only the first 100 rows
sl[1:100, ]
sl = query(db, query = "[ #MAU == e -> MAU == k ]")
# print only the first 100 rows
sl[1:100, ]
# query all sound items that occur
# at the end of a word and are `p`
sl <- query(db, query = "[End(ORT, MAU) == TRUE & MAU == p ]")
sl
# retrieve all word that contain five segments
sl <- query(db, "[Num(ORT, MAU) == 5]")
sl
# requery_seq()

# query all "m" phonetic items
sl_m = query(db, "MAU == m")

# sequential requery (left shift result by 1 (== offset of -1))
# and hence retrieve all phonetic items directly preceeding
# all "m" phonetic items
sl_req_n <- requery_seq(db, 
            seglist = sl_m, 
            offset = -1)
sl_req_n
# query all "m" phonetic items
sl = query(db, "MAU == longMonophthongs")
sl

With the query the user can compile the data frame from the database which then forms the subset for the phonetic analysis. We can select e.g. all instances of certain (or all) vowels, specifying the context before or after etc. etc.

Of course, querying for individual segments in the audio file like words or sounds is possible only, if this information has been added to the database before.

3 Signal processing in R

The first task is to extract the time-amplitude waveform representation (oscillogram) for certain single sounds, e.g. some long .

# query all "aː" phonetic items
# 
sl = query(db, "MAU == aː")
# instead of all 7,000 vowels take only 6
sl = sl[100:105, ]

The data for the oscillogram is stored in the SSF track dft, which is extracted by get_trackdata. The result is another R dataframe.

# get "dft" track data for these segments
a_vowels = get_trackdata(emuDBhandle = db,
                         seglist = sl,
                         ssffTrackName = "MEDIAFILE_SAMPLES",
                         verbose = TRUE)
Warning in get_trackdata(emuDBhandle = db, seglist = sl, ssffTrackName = "MEDIAFILE_SAMPLES", : The emusegs/emuRsegs object passed in refers to bundles with in-homogeneous sampling rates in their audio files! Here is a list of all refered to bundles incl. their sampling rate: 
  session              name             annotates sample_rate
1    0000    AARBECHTSPLAZ1    AARBECHTSPLAZ1.wav       48000
2    0000 AARBECHTSPROZESS1 AARBECHTSPROZESS1.wav       48000
3    0000   AARBECHTSRECHT1   AARBECHTSRECHT1.wav       44100
                    md5_annot_json
1 a4df301bb5ca34b46a0b8438677334d9
2 6156d1a162658be190fa7568a00a6223
3 ef1565335d9bd950a71c7ff9c4d06540

  INFO: parsing 6 wav segments/events

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |======================================================================| 100%

The oscillograms for these 6 vowel instances can then be visualised with ggplot2.

# plot oscillogram
ggplot(data = a_vowels) + 
  # define the df columns for the x and y values
  aes(y = T1, x = times_rel) + 
  # line chart
  geom_line() + 
  # sl_rowIdx groups all rows in a dataframe belonging to the same segment
  # labels contains the label of the segment
  facet_wrap(~ sl_rowIdx + labels)

# query all words beginning with 'Fluch'
# 
sl = query(db, query = "[ ORT =~ 'Fluch.*']")

# get "f0" track data for these segments, in this case calculated on the fly
fluch_words = get_trackdata(emuDBhandle = db,
                         seglist = sl,
                         # using emuR's Michel Scheffers’ Modified Harmonic
                         # Sieve algorithm 
                         onTheFlyFunctionName = "mhsF0",
                         verbose = TRUE)
Warning in get_trackdata(emuDBhandle = db, seglist = sl, onTheFlyFunctionName = "mhsF0", : The emusegs/emuRsegs object passed in refers to bundles with in-homogeneous sampling rates in their audio files! Here is a list of all refered to bundles incl. their sampling rate: 
   session               name              annotates sample_rate
1     0000             FLUCH1             FLUCH1.wav       44100
2     0000             FLUCH2             FLUCH2.wav       44100
3     0000      FLUCHBILLJEE1      FLUCHBILLJEE1.wav       44100
4     0000         FLUCHBLAT1         FLUCHBLAT1.wav       44100
5     0000         FLUCHFELD1         FLUCHFELD1.wav       44100
6     0000 FLUCHGESELLSCHAFT1 FLUCHGESELLSCHAFT1.wav       44100
7     0000        FLUCHHAFEN1        FLUCHHAFEN1.wav       44100
8     0000        FLUCHROUTE1        FLUCHROUTE1.wav       48000
9     0000       FLUCHSCHAIN1       FLUCHSCHAIN1.wav       48000
10    0000            FLUCHT1            FLUCHT1.wav       44100
11    0000       FLUCHTGEFOR1       FLUCHTGEFOR1.wav       44100
12    0000       FLUCHTICKET1       FLUCHTICKET1.wav       44100
                     md5_annot_json
1  4f96300f1fd30205575bd557c6464ed5
2  d4d6ab259206af2592edab568a1f3169
3  2c9d18baef54d1ed2dfe97af377dbb47
4  d21730527f22712aaa0e002341f3831c
5  3289c832ff6ff8b42d1803163d544cce
6  ef31e44d088bf2461cdae5284db0b1c5
7  44ca764332ca16eb00f72978113cd908
8  2d055afc598b0f4c44fa38afd7b295da
9  82e2fff18e70a74058e978932dea9e50
10 17ecb177ca527fc055c778f8708be1ae
11 836a096c2af4b9dd1acdfe6163e3570e
12 6896b5678da288e71af86bf414c67fb1

  INFO: applying mhsF0 to 12 segments/events

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |======================================================================| 100%

In the corresponding graphs then the track for the fundamental frequency (f0) for some isolated words will be drawn.

# plot f0 tracks
ggplot(data = fluch_words) + 
  # define the df columns for the x and y values
  # T1 here is the f0 value
  aes(y = T1, x = times_rel) + 
  # line chart
  geom_line() + 
  # sl_rowIdx groups all rows in a dataframe belonging to the same segment
  # labels contains the label of the segment
  facet_wrap(~ sl_rowIdx + labels)

If needed, you can check the content of your segment list with the EMU WebApp running serve(seglist = sl).

4 Calculate duration for vowel categories

The study of duration of speech segments is a standard task in acoustic phonetics. Let’s see how to solve this in R. Thanks to the label groups which are available in the database, we have easy access to e.g. all shortMonophthongs and all longMonophthongs.

# query all "m" phonetic items
longMonophthongs = query(db, "MAU == longMonophthongs")
shortMonophthongs = query(db, "MAU == shortMonophthongs")
sl = rbind(longMonophthongs, shortMonophthongs)

# print the count; number of rows
nrow(longMonophthongs)
[1] 22950
nrow(shortMonophthongs)
[1] 104115

The data frame contains per segment a start and end column, which allows for simple calculation of the segment duration.

# calculate durations
duration_long = longMonophthongs$end - longMonophthongs$start
duration_short = shortMonophthongs$end - shortMonophthongs$start

# calculate mean and standard deviation
paste("The mean duration for long monophthongs in the databse is", round(mean(duration_long)), "ms. The standard deviation is:", sd(duration_long), ".")
[1] "The mean duration for long monophthongs in the databse is 153 ms. The standard deviation is: 62.9196191781937 ."
The mean duration for long monophthongs in the databse is 153 ms. The standard deviation is: 62.9196191781937 .
paste("The mean duration for short monophthongs in the databse is", round(mean(duration_short)), "ms. The standard deviation is:", sd(duration_short), ".")
[1] "The mean duration for short monophthongs in the databse is 106 ms. The standard deviation is: 51.4225215439996 ."
The mean duration for short monophthongs in the databse is 106 ms. The standard deviation is: 51.4225215439996 .

Instead of having these broad categories long and short we can break them down to the individual vowels. The dataframes just created contain the individual vowels already in the labels column and can thus be easily utilised.

head(longMonophthongs)

First, we create a aggregated datafrom with all vowel segments. This aggregated dataframe now contains 127,065 rows (= long/short vowels).

# add new columns for length, either long or short
longMonophthongs$length <- "long"
shortMonophthongs$length <- "short"

# then merge the two data frame into a single one
all_vowels <- rbind(longMonophthongs, shortMonophthongs)

# add a new colum for segment duration
all_vowels$duration <- round(all_vowels$end - all_vowels$start)

# todo: keep proper sl throughout, will be needed later for requeries!
sl <- all_vowels %>% filter(!str_detect(bundle, "^[CDDEF].*"))
sl$length <- NULL
sl$duration <- NULL

Using the powerful library dplyr, we can now calculate the means for all vowels (= labels).

duration_vowels <- all_vowels %>%
  group_by(labels, length) %>%
  summarise(mean = mean(duration), sd = sd(duration))

duration_vowels

Finally, visualise the means in a bar chart

# basic setting for ggplot with data and aesthetics
ggplot(data = duration_vowels, aes(x= labels, y= mean, fill = length)) +
  geom_bar(stat="identity", position=position_dodge()) +
  # add errorbars
  geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width=.2,
                 position=position_dodge(.9)) +
  theme_minimal()

From here on the duration values can be utilised in further statistical analyses.

5 Vowels formants and visualisations with ggplot2

Formant frequencies are the acoustic representation of the vowel’s quality. Of these, the first two formants are crucial for the vowels identity, i.e. F1 and F2. The values for the formants are already calculated in the database and will now be extracted for all the vowels our dataframe all_vowels.

As it takes long to do this, the dataframe has been precompiled and saved to file … and reloaded again.

# saveRDS(td_vowels, file ="td_vowels.RDS")
# td_vowels <- readRDS("td_vowels.RDS")

Due to the varying length of all vowels, it is necessary to normalize the length with normalize_length. This step takes long and, again, the dataframe has already been created and saved.

# normalise the length
# td_vowels_norm = normalize_length(td_vowels %>% select(-duration, -length), colNames = c("T1", "T2"))

# do some cleaning
# exclude words spoken by another (male) speaker
# td_vowels_norm <- td_vowels_norm %>%
#  filter(!str_detect(bundle, "^[CDDEF].*"))

# save this dataframe 
#saveRDS(td_vowels_norm, file ="td_vowels_norm.RDS")

# load this data frame
td_vowels_norm <- readRDS("td_vowels_norm.RDS")

Formants are calculated for every sampling point of the audio file. Thus, the dataframe td_vowels_norm now contains some 2,1 million rows/values.

For illustration purposes, the formants for the long vowels in three example words are selected into a dataframe (maachen, Viz, Kuuscht).

formant_examples <- td_vowels_norm %>%
  filter(sl_rowIdx == "722" | sl_rowIdx == "21702" | sl_rowIdx == "11191")

These three vowels then are visualised as line charts.

ggplot(data=formant_examples) +
  geom_smooth(aes(x = times_norm, y = T1, col = labels, group = labels)) +
  geom_smooth(aes(x = times_norm, y = T2, col = labels, group = labels)) +
  labs(x = "Duration (normalized)", y = "F1 + F2 (Hz)") +
  ggtitle("Example formants F1, F2 the three corner vowels") +
  facet_wrap(~ labels)

To get the identity of a vowel, it is common practise to select only one point in the middle of the vowel. To do so, I am using a filter and select the F1 and F2 values at the normalised time point 0.4. This gives us F1 and F2 values for 104,055 vowels.

vowel_midpoints = td_vowels_norm %>% 
  filter(times_norm == 0.4)
vowel_midpoints$labels <-   as.factor(vowel_midpoints$labels)

Again, some further data conversion.

# convert Hertz to Bark
# convert T1, T2, T3 to Bark
td_vowels_norm$T1 <- emuR::bark(td_vowels_norm$T1)
td_vowels_norm$T2 <- emuR::bark(td_vowels_norm$T2)

# insert new columns from requeries to have the info about the next and next_next and previous label after the vowel
vowel_midpoints$next_label = as.factor(requery_seq(db, seglist = sl, offset=1)$labels)
vowel_midpoints$next_next_label = as.factor(requery_seq(db, seglist = sl, offset=2, ignoreOutOfBounds =TRUE)$labels)
vowel_midpoints$word = requery_hier(db, seglist = sl, level="ORT")$labels
vowel_midpoints$previous_label <- as.factor(requery_seq(db, seglist = sl, offset= -1)$labels)

The dataframe vowel_midpoints is now enriched with additional columns for the actual word (of a vowel instance), previous_label, next_label and next_next_label, which allows to select data more fine-grained for the following visualisations, e.g. all vowels with a g in front and an R after it.

head(vowel_midpoints)

Prepared in such a way, the visualisations of this sheer amount of vowels can commence. Vowel quality is charted in scatter plots, where the F1 value is plotted on the y-axis and the F2 value on the y-axis. To gain resemblance with the articulation of vowels, both axes are reversed. Thus i is located in the top front, u in the top back and a in the low position.

The result looks rather flashy, but is also flawd. First, there are several outliers

ggplot(vowel_midpoints) +
  aes(x = T2, y = T1, label = labels, col = labels) +
  geom_text() +
  scale_y_reverse() + 
  scale_x_reverse(breaks = seq(from=7, to=14, by=1), minor_breaks = NULL) + 
  labs(x = "F2 (Bark)", y = "F1 (Bark)") +
  #theme(legend.position = "none")
  ggtitle("Scatter plot of all vowels in the LOD database")

6 Calculate the Pillai distance

7 Vowel explorer

Try it here.

knitr::include_app("https://petergill.shinyapps.io/shinyplay/")
---
subtitle: "Lecture for class: Data science in the Humanities"
title: "Big data in the acoustic phonetic analysis"

# needed for github
knit: (function(input_file, encoding) {
  out_dir <- 'docs';
  rmarkdown::render(input_file,
 encoding=encoding,
 output_file=file.path(dirname(input_file), out_dir, 'index.html'))})
 
author: "Peter Gilles"
url: "https://twitter.com/PeterGilles"

date: "29. April 2020, 14h00 - 16h00, University of Luxembourg"
output:
  #tufte::tufte_html:
  #tufte::tufte_handout:
  #pdf_document:
  #  latex_engine: xelatex
  #  toc: true
  #  toc_depth: 2
  #  number_sections: true
  html_notebook: 
    toc: true
    toc_depth: 2
    number_sections: true
---

This tutorial will demonstrate how to analyse audio data for acoustic phonetic studies in R. It is mainly intended to demonstrated possible workflows. The topics covered are:

* phonetic databases, the case of `emuR` and `EMU-SDMS`, the EMU Speech Database Management System
* Sample data: the LOD database
* Queries and requeries
* Inscpect the database: `serve()`
* Calculate duration for vowel categories
* Vowels formants and visualisations with `ggplot2`
* Vowel explorer
* Calculating the Pillai distance

# Preamble {-}

The generated HTML page of this tutorial is available [here](https://petergilles.github.io/Data_Science_Humanities/index.nb.html).

This tutorial is organised as  an [R Markdown](http://rmarkdown.rstudio.com) notebook. To execute the code within this notebook (filename `Lecture.Rmd`) it has to be opend in RStudio. When executing code, the results appear beneath the code. In order to do so, you need to have R and RStudio installed. When this R notebook is loaded into RStudio, you can excecute chunks by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*, allowing you to experiment with the code. This handout contains all output of the code (tables, visualisations etc.). The easiest way to work with this R code is to clone the entire project from this GitHub repository.

The following libraries will be needed and have to be installed.
```{r Libraries, message=FALSE, warning=FALSE}
library(tidyverse)
library(ggiraph)
library(cowplot)
library(emuR)
library(tools)
library(rio)
library(ggplot2)
library(magrittr)
library(ggiraph)
library(htmltools)
library(shiny)
library(joeyr)
library(knitr)
```


# The LOD database

The database contains the audio recordings from the `Lëtzebuerger Online Dictionnaire` (available [here](https://github.com/spellchecker-lu), spoken by one female speaker. The audio files have been automatically segmented with the [MAUS tools](https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface). We thus have a database conisting of textual data, basically words, and the corresponding audio data. The audio data is segmented into words and phonetic segments (sounds).

This database has been created beforehand. Infos how to create such a database is explained in the [EMU-SDMS manual](https://ips-lmu.github.io/The-EMU-SDMS-Manual/).

While we will be working with the full database (26.000 recordings), a small demo database `lod_emuDB` has been created with 500 recordings which can be used for individual testing.

We start with loading the database and give an overview of structure and content.
```{r}
# load database
db = load_emuDB("/Users/peter.gilles/Documents/_Daten/LOD-emuDB/lod_emuDB")
#db = load_emuDB("lod_emuDB")

# display the overview of the structure and content
summary(db)

```

Tracks in `emuR` are acoustic representation of the speech signal, here `dft` for the waveform (time-amplitude representation) and `praatFms` for the formant measures of vowels (see below). 

Levels in an `eumR` database stand for level of interlinked linguistic information. `bundle` is the entire audio file, `ORT` stands for the orthographical representation of the audio file segmented in its single words. `MAU` is the segmentation of all phonetic segment (=sounds) of all `ORT` segments in `bundle`.

The hierarchical structure of these levels is expressed in the `link definitions` as `ONE-TO-MANY`.

# Database queries

An emuR database can be queried with a powerful query engine. The first example is a simple query for one word, `Aarbecht`.

```{r}
sl = query(db, query = "[ ORT == 'Aarbecht']")
sl
```

The result is a `segment list` (`sl`), containing various information about the found item (time, level, name, database info). The result of the query can also be displayed in the EMU Speech Database Management System. 

`serve(db, seglist = sl)`

The GUI will open in the `Viewer` pane of RStudio or you can open it in a browser (Chrome preferred).

```{r echo=FALSE}
knitr::include_graphics(rep("emu-sdms.png"))
```

Here we can also display the hiearchical structure for this database item, which is accessed during queries. `bundle` is the top-level, representing the entire audio file. 

The `ORT` level contains the nodes for the individual words in the `bundle`, here the two words `Aarbecht` and `Aarbechten`. The dependend level then is `MAU` (=`Munich Automatic Unit`) representing the single sounds of the words in `ORT`. 

```{r echo=FALSE}
knitr::include_graphics(rep("hierarchy.png"))
```

Two aspects render emuR query system extremly powerful: the use of regular expressions (including negation and other extensions) and the combinated query on different levels of the database.

Let's try more complex queries:

- regular expression, operator `=~`, words beginning with `Aarbecht...`
```{r}
sl = query(db, query = "[ ORT =~ 'Aarbecht.*']")
sl
```

Select the vowel [aː] in all words beginning with `Aarbecht`... Note that in the segment list the label now has changed to the vowel and the respective start-end information is now only for this sound [aː].

```{r}
sl = query(db, query = "[ ORT =~ 'Aarbecht.*' ^ #MAU=='aː']")
sl

```

- query a sequence of sounds, e.g. `e` followed by `k` (`Méck`), by using the sequence operator `->`.

```{r}
sl = query(db, query = "[ MAU == e -> MAU == k ]")
# print only the first 100 rows
sl[1:100, ]

```

- in the sequence `e->k`, query only the vowel. Use the result modifier `#`.

```{r}
sl = query(db, query = "[ #MAU == e -> MAU == k ]")
# print only the first 100 rows
sl[1:100, ]
```

```{r}
# query all sound items that occur
# at the end of a word and are `p`
sl <- query(db, query = "[End(ORT, MAU) == TRUE & MAU == p ]")
sl
```

```{r}
# retrieve all word that contain five segments
sl <- query(db, "[Num(ORT, MAU) == 5]")
sl
```

* using `requery` the results from a previous query can be further specified

```{r}
# requery_seq()

# query all "m" phonetic items
sl_m = query(db, "MAU == m")

# sequential requery (left shift result by 1 (== offset of -1))
# and hence retrieve all phonetic items directly preceeding
# all "m" phonetic items
sl_req_n <- requery_seq(db, 
            seglist = sl_m, 
            offset = -1)
sl_req_n
```

* groups auf sounds can be grouped together to `label groups`, e.g. all long monophthongs (`"iː", "uː", "aː", "oː", "ɔː", "ɛː", "eː"`) or all short monophthongs (`"i", "u", "ɑ", "o", "æ", "e", "ə", "ɐ"`).

```{r}
# query all "m" phonetic items
sl = query(db, "MAU == longMonophthongs")
sl
```

With the query the user can compile the data frame from the database which then forms the subset for the phonetic analysis. We can select e.g. all instances of certain (or all) vowels, specifying the context before or after etc. etc.

Of course, querying for individual segments in the audio file like words or sounds is possible only, if this information has been added to the database before.


# Signal processing in R

The first task is to extract the time-amplitude waveform representation (oscillogram) for certain single sounds, e.g. some long `aː`.

```{r}
# query all "aː" phonetic items
# 
sl = query(db, "MAU == aː")
# instead of all 7,000 vowels take only 6
sl = sl[100:105, ]

```

The data for the oscillogram is stored in the SSF track `dft`, which is extracted by `get_trackdata`. The result is another R dataframe.

```{r}
# get "dft" track data for these segments
a_vowels = get_trackdata(emuDBhandle = db,
                         seglist = sl,
                         ssffTrackName = "MEDIAFILE_SAMPLES",
                         verbose = TRUE)
```

The oscillograms for these 6 vowel instances can then be visualised with `ggplot2`.

```{r}
# plot oscillogram
ggplot(data = a_vowels) + 
  # define the df columns for the x and y values
  aes(y = T1, x = times_rel) + 
  # line chart
  geom_line() + 
  # sl_rowIdx groups all rows in a dataframe belonging to the same segment
  # labels contains the label of the segment
  facet_wrap(~ sl_rowIdx + labels)
```

```{r}
# query all words beginning with 'Fluch'
# 
sl = query(db, query = "[ ORT =~ 'Fluch.*']")

# get "f0" track data for these segments, in this case calculated on the fly
fluch_words = get_trackdata(emuDBhandle = db,
                         seglist = sl,
                         # using emuR's Michel Scheffers’ Modified Harmonic
                         # Sieve algorithm 
                         onTheFlyFunctionName = "mhsF0",
                         verbose = TRUE)
```

In the corresponding graphs then the track for the fundamental frequency (f0) for some isolated words will be drawn.

```{r}
# plot f0 tracks
ggplot(data = fluch_words) + 
  # define the df columns for the x and y values
  # T1 here is the f0 value
  aes(y = T1, x = times_rel) + 
  # line chart
  geom_line() + 
  # sl_rowIdx groups all rows in a dataframe belonging to the same segment
  # labels contains the label of the segment
  facet_wrap(~ sl_rowIdx + labels)
```

If needed, you can check the content of your segment list with the EMU WebApp running `serve(seglist = sl)`.

# Calculate duration for vowel categories

The study of duration of speech segments is a standard task in acoustic phonetics. Let's see how to solve this in `R`. Thanks to the `label groups` which are available in the database, we have easy access to e.g. all `shortMonophthongs` and all `longMonophthongs`.

```{r}
# query all "m" phonetic items
longMonophthongs = query(db, "MAU == longMonophthongs")
shortMonophthongs = query(db, "MAU == shortMonophthongs")
sl = rbind(longMonophthongs, shortMonophthongs)

# print the count; number of rows
nrow(longMonophthongs)
nrow(shortMonophthongs)
```

The data frame contains per segment a start and end column, which allows for simple calculation of the segment duration.

```{r}
# calculate durations
duration_long = longMonophthongs$end - longMonophthongs$start
duration_short = shortMonophthongs$end - shortMonophthongs$start

# calculate mean and standard deviation
paste("The mean duration for long monophthongs in the databse is", round(mean(duration_long)), "ms. The standard deviation is:", sd(duration_long), ".")

paste("The mean duration for short monophthongs in the databse is", round(mean(duration_short)), "ms. The standard deviation is:", sd(duration_short), ".")
```

Instead of having these broad categories `long` and `short` we can break them down to the individual vowels. The dataframes just created contain the individual vowels already in the `labels` column and can thus be easily utilised.

```{r}
head(longMonophthongs)
```

First, we create a aggregated datafrom with all vowel segments. This aggregated dataframe now contains 127,065 rows (= long/short vowels).

```{r}
# add new columns for length, either long or short
longMonophthongs$length <- "long"
shortMonophthongs$length <- "short"

# then merge the two data frame into a single one
all_vowels <- rbind(longMonophthongs, shortMonophthongs)

# add a new colum for segment duration
all_vowels$duration <- round(all_vowels$end - all_vowels$start)

# todo: keep proper sl throughout, will be needed later for requeries!
sl <- all_vowels %>% filter(!str_detect(bundle, "^[CDDEF].*"))
sl$length <- NULL
sl$duration <- NULL
```

Using the powerful library `dplyr`, we can now calculate the means for all vowels (= `labels`).

```{r}
duration_vowels <- all_vowels %>%
  group_by(labels, length) %>%
  summarise(mean = mean(duration), sd = sd(duration))

duration_vowels
```

Finally, visualise the means in a bar chart

```{r}
# basic setting for ggplot with data and aesthetics
ggplot(data = duration_vowels, aes(x= labels, y= mean, fill = length)) +
  geom_bar(stat="identity", position=position_dodge()) +
  # add errorbars
  geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width=.2,
                 position=position_dodge(.9)) +
  theme_minimal()
```

From here on the duration values can be utilised in further statistical analyses.

# Vowels formants and visualisations with `ggplot2`

Formant frequencies are the acoustic representation of the vowel's quality. Of these, the first two formants are crucial for the vowels identity, i.e. F1 and F2. The values for the formants are already calculated in the database and will now be extracted for all the vowels our dataframe `all_vowels`.

```{r include=FALSE}
# get formant values for those segments
# either from pre-generated Praat-Track
#td_vowels = get_trackdata(db, all_vowels, 
#                          ssffTrackName = "praatFms", 
#                          resultType = "tibble")
```

As it takes long to do this, the dataframe has been precompiled and saved to file ... and reloaded again.
```{r echo=TRUE}
# saveRDS(td_vowels, file ="td_vowels.RDS")
# td_vowels <- readRDS("td_vowels.RDS")
```

Due to the varying length of all vowels, it is necessary to normalize the length with `normalize_length`. This step takes long and, again, the dataframe has already been created and saved.

```{r}
# normalise the length
# td_vowels_norm = normalize_length(td_vowels %>% select(-duration, -length), colNames = c("T1", "T2"))

# do some cleaning
# exclude words spoken by another (male) speaker
# td_vowels_norm <- td_vowels_norm %>%
#  filter(!str_detect(bundle, "^[CDDEF].*"))

# save this dataframe 
#saveRDS(td_vowels_norm, file ="td_vowels_norm.RDS")

# load this data frame
td_vowels_norm <- readRDS("td_vowels_norm.RDS")
```

Formants are calculated for every sampling point of the audio file. Thus, the dataframe `td_vowels_norm` now contains some 2,1 million rows/values.

For illustration purposes, the formants for the long vowels in three example words are selected into a dataframe (*maachen, Viz, Kuuscht*).

```{r}
formant_examples <- td_vowels_norm %>%
  filter(sl_rowIdx == "722" | sl_rowIdx == "21702" | sl_rowIdx == "11191")
```

These three vowels then are visualised as line charts.

```{r message=FALSE}
ggplot(data=formant_examples) +
  geom_smooth(aes(x = times_norm, y = T1, col = labels, group = labels)) +
  geom_smooth(aes(x = times_norm, y = T2, col = labels, group = labels)) +
  labs(x = "Duration (normalized)", y = "F1 + F2 (Hz)") +
  ggtitle("Example formants F1, F2 the three corner vowels") +
  facet_wrap(~ labels)
```

To get the identity of a vowel, it is common practise to select only one point in the middle of the vowel. To do so, I am using a `filter` and select the F1 and F2 values at the normalised time point `0.4`. This gives us F1 and F2 values for 104,055 vowels.

```{r}
vowel_midpoints = td_vowels_norm %>% 
  filter(times_norm == 0.4)
vowel_midpoints$labels <-   as.factor(vowel_midpoints$labels)
```

Again, some further data conversion.
```{r warning=FALSE}
# convert Hertz to Bark
# convert T1, T2, T3 to Bark
td_vowels_norm$T1 <- emuR::bark(td_vowels_norm$T1)
td_vowels_norm$T2 <- emuR::bark(td_vowels_norm$T2)

# insert new columns from requeries to have the info about the next and next_next and previous label after the vowel
vowel_midpoints$next_label = as.factor(requery_seq(db, seglist = sl, offset=1)$labels)
vowel_midpoints$next_next_label = as.factor(requery_seq(db, seglist = sl, offset=2, ignoreOutOfBounds =TRUE)$labels)
vowel_midpoints$word = requery_hier(db, seglist = sl, level="ORT")$labels
vowel_midpoints$previous_label <- as.factor(requery_seq(db, seglist = sl, offset= -1)$labels)
```

The dataframe `vowel_midpoints` is now enriched with additional columns for the actual `word` (of a vowel instance), `previous_label`, `next_label` and `next_next_label`, which allows to select data more fine-grained for the following visualisations, e.g. all vowels `aː` with a `g` in front and an `R` after it.

```{r}
head(vowel_midpoints)
```

Prepared in such a way, the visualisations of this sheer amount of vowels can commence. Vowel quality is charted in scatter plots, where the F1 value is plotted on the y-axis and the F2 value on the y-axis. To gain resemblance with the articulation of vowels, both axes are reversed. Thus `i` is located in the top front, `u` in the top back and `a` in the low position.

The result looks rather flashy, but is also flawd. First, there are several outliers

```{r}
ggplot(vowel_midpoints) +
  aes(x = T2, y = T1, label = labels, col = labels) +
  geom_text() +
  scale_y_reverse() + 
  scale_x_reverse(breaks = seq(from=7, to=14, by=1), minor_breaks = NULL) + 
  labs(x = "F2 (Bark)", y = "F1 (Bark)") +
  #theme(legend.position = "none")
  ggtitle("Scatter plot of all vowels in the LOD database")
```


# Calculate the Pillai distance

## from https://joeystanley.com/blog/a-tutorial-in-calculating-vowel-overlap

```{r echo=FALSE}
knitr::include_graphics(rep("pillai_example.png"))
```

# Vowel explorer

Try it [here](https://petergill.shinyapps.io/shinyplay/).

```{r}

knitr::include_app("https://petergill.shinyapps.io/shinyplay/")

```



